{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.使用 lightGBM 预测音乐推荐结果     模型训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 首先 import 必要的模块\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "import pickle as pk\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "import scipy.io as sio\n",
    "import scipy.sparse as ss\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightgbm as lgbm\n",
    "from lightgbm.sklearn import LGBMClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = '../Data/'  # 文件路径\n",
    "model_path = '../model/' # 模型路径"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 载入标签数据\n",
    "with open(model_path + 'target_list.pkl','rb') as fr:\n",
    "    train_Y = pk.load(fr)\n",
    "fr.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# with open(model_path + 'data_all_train_lgbm_v1.pkl','wb') as fw:\n",
    "#     pk.dump(train_X,fw)\n",
    "# fw.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 载入经过处理的训练数据\n",
    "with open(model_path + 'data_all_train_lgbm_v1.pkl','rb') as fr:\n",
    "    train_X = pk.load(fr)\n",
    "fr.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LightGBM超参数调优"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM的主要的超参包括：\n",
    "1. 树的数目n_estimators 和 学习率 learning_rate\n",
    "2. 树的最大深度max_depth 和 树的最大叶子节点数目num_leaves（注意：XGBoost只有max_depth，LightGBM采用叶子优先的方式生成树，num_leaves很重要，设置成比 2^max_depth 小）\n",
    "3. 叶子结点的最小样本数:min_data_in_leaf(min_data, min_child_samples)\n",
    "4. 每棵树的列采样比例：feature_fraction/colsample_bytree\n",
    "5. 每棵树的行采样比例：bagging_fraction （需同时设置bagging_freq=1）/subsample\n",
    "6. 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)\n",
    "7. 两个非模型复杂度参数，但会影响模型速度和精度。可根据特征取值范围和样本数目修改这两个参数\n",
    "1）特征的最大bin数目max_bin：默认255；\n",
    "2）用来建立直方图的样本数目subsample_for_bin：默认200000。\n",
    "\n",
    "对n_estimators，用LightGBM内嵌的cv函数调优，因为同XGBoost一样，LightGBM学习的过程内嵌了cv，速度极快。\n",
    "其他参数用GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_ROUNDS = 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 相同的交叉验证分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare cross validation\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "\n",
    "kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# #直接调用 lightgbm 内嵌的交叉验证(cv)，可对连续的 n_estimators 参数进行快速交叉验证\n",
    "# #而GridSearchCV只能对有限个参数进行交叉验证，且速度相对较慢\n",
    "# def get_n_estimators(params , train_X , train_Y , early_stopping_rounds=10):\n",
    "#     lgbm_params = params.copy()\n",
    "     \n",
    "#     lgbmtrain = lgbm.Dataset(train_X , train_Y)\n",
    "     \n",
    "#     #num_boost_round为弱分类器数目，下面的代码参数里因为已经设置了early_stopping_rounds\n",
    "#     #即性能未提升的次数超过过早停止设置的数值，则停止训练\n",
    "#     cv_result = lgbm.cv(lgbm_params , lgbmtrain , num_boost_round=MAX_ROUNDS , nfold=3,  metrics='auc' , early_stopping_rounds=early_stopping_rounds,seed=3 )\n",
    "     \n",
    "#     return cv_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 为了得到 n_estimators ，需要为其他重要的参数赋一个初始值，如 params 中的值。\n",
    "初始值的意义不大，只是为了方便确定其他参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# params = {'boosting_type': 'gbdt',\n",
    "#           'objective': 'regression',\n",
    "#           'n_jobs': 6,\n",
    "#           'learning_rate': 0.1,\n",
    "#           'num_leaves': 60,\n",
    "#           'max_depth': 6,\n",
    "#           'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "#           'subsample': 0.7,\n",
    "#           'bagging_freq': 1,\n",
    "#           'colsample_bytree': 0.7,\n",
    "#          }\n",
    "\n",
    "# cv_result = get_n_estimators(params , train_X , train_Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cv_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print('best_score:%f '% cv_result['auc-mean'][-10:-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# best_n_estimators = len(cv_result['auc-mean'])\n",
    "# print(\"best estimator's num: %d\" % n_estimators)\n",
    "best_n_estimators = 9431"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. num_leaves & max_depth=7\n",
    "num_leaves建议70-80，搜索区间50-80,值越大模型越复杂，越容易过拟合\n",
    "相应的扩大max_depth=7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# num_leaves_s = range(50,90,10) #50,60,70,80\n",
    "# tuned_parameters = dict(num_leaves = num_leaves_s)\n",
    "# print(tuned_parameters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# params = {'boosting_type': 'gbdt',\n",
    "#           'objective': 'regression',\n",
    "#           'n_jobs': 6,\n",
    "#           'learning_rate': 0.01,\n",
    "#           'min_child_samples':20,\n",
    "#           'n_estimators':best_n_estimators,\n",
    "#           'max_depth': 7,\n",
    "#           'max_bin': 64, #原始特征为整数，很少超过100\n",
    "#           'subsample': 0.7,\n",
    "#           'bagging_freq': 1,\n",
    "#           'colsample_bytree': 0.5,\n",
    "#           'num_leaves': 30,\n",
    "#          }\n",
    "# lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "# # num_leaves_s = range(10,50,20) #50,60,70,80\n",
    "# # tuned_parameters = dict( num_leaves = num_leaves_s)\n",
    "# lg.fit(train_X,train_Y)\n",
    "# # print('tuned_parameters',tuned_parameters)\n",
    "# # grid_search_num_leaves = GridSearchCV(lg, n_jobs=6, param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=10, refit = False)\n",
    "# # grid_search_num_leaves.fit(train_X , train_Y)\n",
    "# #grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try:\n",
    "#     with open(model_path + 'lg_v2_2_1.pkl','wb') as fw:\n",
    "#         pk.dump(lg,fw)\n",
    "#     fw.close()\n",
    "# except Exception as e:\n",
    "#     print('dump lg_v2_2_1.pkl error,',e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# params = {'boosting_type': 'gbdt',\n",
    "#           'objective': 'regression',\n",
    "#           'n_jobs': 6,\n",
    "#           'learning_rate': 0.01,\n",
    "#           'min_child_samples':20,\n",
    "#           'n_estimators':best_n_estimators,\n",
    "#           'max_depth': 7,\n",
    "#           'max_bin': 64, #原始特征为整数，很少超过100\n",
    "#           'subsample': 0.7,\n",
    "#           'bagging_freq': 1,\n",
    "#           'colsample_bytree': 0.8,\n",
    "#           'num_leaves': 30,\n",
    "#          }\n",
    "# lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "# # num_leaves_s = range(10,50,20) #50,60,70,80\n",
    "# # tuned_parameters = dict( num_leaves = num_leaves_s)\n",
    "# lg.fit(train_X,train_Y)\n",
    "# # print('tuned_parameters',tuned_parameters)\n",
    "# # grid_search_num_leaves = GridSearchCV(lg, n_jobs=6, param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=10, refit = False)\n",
    "# # grid_search_num_leaves.fit(train_X , train_Y)\n",
    "# #grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try:\n",
    "#     with open(model_path + 'lg_v2_2_2.pkl','wb') as fw:\n",
    "#         pk.dump(lg,fw)\n",
    "#     fw.close()\n",
    "# except Exception as e:\n",
    "#     print('dump lg_v2_2_2.pkl error,',e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 9h 36min 26s, sys: 21min 23s, total: 9h 57min 50s\n",
      "Wall time: 2h 45min 59s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(bagging_freq=1, boosting_type='gbdt', class_weight=None,\n",
       "               colsample_bytree=0.8, importance_type='split',\n",
       "               learning_rate=0.01, max_bin=64, max_depth=7,\n",
       "               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,\n",
       "               n_estimators=9431, n_jobs=6, num_leaves=20,\n",
       "               objective='regression', random_state=None, reg_alpha=0.0,\n",
       "               reg_lambda=0.0, silent=False, subsample=0.6,\n",
       "               subsample_for_bin=200000, subsample_freq=0)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "# 相比较与 lg_v2_2_2.pkl， 减小 subsample\n",
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'regression',\n",
    "          'n_jobs': 6,\n",
    "          'learning_rate': 0.01,\n",
    "          'min_child_samples':20, # 每个叶子节点的最小样本数目\n",
    "          'n_estimators':best_n_estimators,\n",
    "          'max_depth': 7,\n",
    "          'max_bin': 64, #原始特征为整数，很少超过100\n",
    "          'subsample': 0.6,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.8,\n",
    "          'num_leaves': 20,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "# num_leaves_s = range(10,50,20) #50,60,70,80\n",
    "# tuned_parameters = dict( num_leaves = num_leaves_s)\n",
    "lg.fit(train_X,train_Y)\n",
    "# print('tuned_parameters',tuned_parameters)\n",
    "# grid_search_num_leaves = GridSearchCV(lg, n_jobs=6, param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=10, refit = False)\n",
    "# grid_search_num_leaves.fit(train_X , train_Y)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    with open(model_path + 'lg_v2_2_3.pkl','wb') as fw:\n",
    "        pk.dump(lg,fw)\n",
    "    fw.close()\n",
    "except Exception as e:\n",
    "    print('dump lg_v2_2_3.pkl error,',e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# # examine the best model\n",
    "# print(-grid_search.best_score_)\n",
    "# print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# # plot CV误差曲线\n",
    "# test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "# test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "# train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "# train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "# n_leafs = len(num_leaves_s)\n",
    "\n",
    "# x_axis = num_leaves_s\n",
    "# plt.plot(x_axis, -test_means)\n",
    "# #plt.errorbar(x_axis, -test_means, yerr=test_stds,label = ' Test')\n",
    "# #plt.errorbar(x_axis, -train_means, yerr=train_stds,label = ' Train')\n",
    "# plt.xlabel( 'num_leaves' )\n",
    "# plt.ylabel( 'Log Loss' )\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test_means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 性能抖动，取系统推荐值：70, 不必再细调"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. min_child_samples\n",
    "叶子节点的最小样本数目\n",
    "\n",
    "叶子节点数目：70，共9类，平均每类8个叶子节点\n",
    "每棵树的样本数目数目最少的类（稀有事件）的样本数目：200 * 2/3 * 0.7 = 100\n",
    "所以每个叶子节点约100/8 = 12个样本点\n",
    "\n",
    "搜索范围：10-50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'regression',\n",
    "          'n_jobs': 6,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':best_n_estimators,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'max_bin': 32, # 稀疏离散特征一共有167维，先不设置很大，提高训练速度\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.7,\n",
    "          'early_stopping_round':10 # 添加 early_stopping_round，减少迭代次数\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "min_child_samples_s = range(10,50,10) \n",
    "tuned_parameters = dict( min_child_samples = min_child_samples_s)\n",
    "\n",
    "grid_search_v1 = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search_v1.fit(X_train , y_train)\n",
    "grid_search_v1.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search_v1.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search_v1.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search_v1.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search_v1.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = min_child_samples_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores, yerr=test_stds ,label = ' Test')\n",
    "#plt.errorbar(x_axis, -train_scores, yerr=train_stds,label =  +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "test_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### min_child_samples=30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 行采样参数 sub_samples/bagging_fraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 2,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':30,\n",
    "          'max_bin': 127, # 计算特征直方图时的bin 的最大数量\n",
    "          #'subsample': 0.7,\n",
    "          'bagging_freq': 1, # bagging 的频率，表示每 K 次执行一次bagging\n",
    "          'colsample_bytree': 0.7, # 建每棵树使用的样本比例\n",
    "          # ‘sparse_threshold’: 0.8  # 稀疏比例阈值(0值的比例超过该值为稀疏特征)\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "subsample_s = [i/10.0 for i in range(5,10)]\n",
    "tuned_parameters = dict( subsample = subsample_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = subsample_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores[:,i], yerr=test_stds[:,i] ,label = str(max_depths[i]) +' Test')\n",
    "#plt.errorbar(x_axis, -train_scores[:,i], yerr=train_stds[:,i] ,label = str(max_depths[i]) +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### subsample=0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 列采样参数 sub_feature/feature_fraction/colsample_bytree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 2,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':30,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,  \n",
    "          'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "colsample_bytree_s = [i/10.0 for i in range(5,10)]\n",
    "tuned_parameters = dict( colsample_bytree = colsample_bytree_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = colsample_bytree_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores[:,i], yerr=test_stds[:,i] ,label = str(max_depths[i]) +' Test')\n",
    "#plt.errorbar(x_axis, -train_scores[:,i], yerr=train_stds[:,i] ,label = str(max_depths[i]) +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再调小一点，由于特征包括原始特征+tfidf特征，是多了些"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 2,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':30,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "colsample_bytree_s = [i/10.0 for i in range(3,5)]\n",
    "tuned_parameters = dict( colsample_bytree = colsample_bytree_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### colsample_bytree=0.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)感觉不用调了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 减小学习率，调整n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 2,\n",
    "          'learning_rate': 0.01,\n",
    "          #'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':30,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "n_estimators_2 = get_n_estimators(params , X_train , y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 用所有训练数据，采用最佳参数重新训练模型\n",
    "由于样本数目增多，模型复杂度稍微扩大一点？\n",
    "num_leaves增多5\n",
    "min_child_samples按样本比例增加到40"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 2,\n",
    "          'learning_rate': 0.01,\n",
    "          'n_estimators':n_estimators_2,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':75,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "lg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 保存模型，用于后续测试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cPickle\n",
    "\n",
    "cPickle.dump(lg, open(\"Otto_LightGBM_org_tfidf.pkl\", 'wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 特征重要性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"columns\":list(feat_names), \"importance\":list(lg.feature_importances_.T)})\n",
    "df = df.sort_values(by=['importance'],ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "plt.bar(range(len(lg.feature_importances_)), lg.feature_importances_)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "tfidf的特征重要性更高一些。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
